Assessing the accuracy of California county level COVID-19 hospitalization forecasts to inform public policy decision making

Background The COVID-19 pandemic has highlighted the role of infectious disease forecasting in informing public policy. However, significant barriers remain for effectively linking infectious disease forecasts to public health decision making, including a lack of model validation. Forecasting model performance and accuracy should be evaluated retrospectively to understand under which conditions models were reliable and could be improved in the future. Methods Using archived forecasts from the California Department of Public Health’s California COVID Assessment Tool (https://calcat.covid19.ca.gov/cacovidmodels/), we compared how well different forecasting models predicted COVID-19 hospitalization census across California counties and regions during periods of Alpha, Delta, and Omicron variant predominance. Results Based on mean absolute error estimates, forecasting models had variable performance across counties and through time. When accounting for model availability across counties and dates, some individual models performed consistently better than the ensemble model, but model rankings still differed across counties. Local transmission trends, variant prevalence, and county population size were informative predictors for determining which model performed best for a given county based on a random forest classification analysis. Overall, the ensemble model performed worse in less populous counties, in part because of fewer model contributors in these locations. Conclusions Ensemble model predictions could be improved by incorporating geographic heterogeneity in model coverage and performance. Consistency in model reporting and improved model validation can strengthen the role of infectious disease forecasting in real-time public health decision making. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-023-15649-0.


Background
In public health, forecasting has been used to predict infectious disease dynamics for a variety of diseases including influenza, dengue fever, Ebola virus disease, Zika fever, and most recently COVID-19, which has highlighted the importance of infectious disease modeling to help inform public health decision making [1]. Nevertheless, significant barriers remain for effectively linking infectious disease forecasts with public health decision making including a lack of model standardization and validation, and difficulty in successfully communicating model complexity and uncertainty [2]. Moreover, public health practitioners may need different outcomes or indicators than what forecast models provide [2,3].
In June 2020, as part of the COVID-19 response, the California Department of Public Health's (CDPH) COVID Modeling Team launched the California Communicable diseases Assessment Tool (CalCAT) to compile available COVID-19 models, mostly from academic groups, to inform policy and public health action [4]. Cal-CAT provides nowcasts (R-effective estimates), forecasts (short-term predictions for hospitalizations, ICU admissions, and deaths), and longer-range scenario models for a variety of COVID indicators at the state, regional, and county scales. Some contributors are national, forecasting for all states, while others focus only on California and may not be publicly available elsewhere ( Table 1). The models on CalCAT have been used throughout the COVID-19 pandemic to evaluate current transmission trends and prospective hospital and intensive care unit capacity. This information combined with other evidence and policy considerations has helped to inform the implementation of stay-at-home orders and statewide mask mandates (e.g., reinstating a mask mandate during the emergence of Omicron/BA.1). In addition, models combined with other data streams were used to inform metrics for the Blueprint for a Safer Economy including the nation's first health equity metric and to support planning for vaccine allocation and distribution [5].
During the COVID-19 pandemic response, many California local health jurisdictions communicated the importance of forecasts focused on the relevant scale of decision making (e.g., county-vs. state-level forecasts) because there was significant geographic heterogeneity in COVID-19 outcomes at regional and local levels [9]. A better understanding of how forecasting models have captured these geographical heterogeneities could help inform local public health decision making during future COVID-19 waves and enable local health jurisdictions to employ models judiciously given proven past performance. Lessons learned from COVID-19 forecasting efforts can also be applied to future modeling for other diseases including influenza.
We retrospectively evaluated archived forecasting predictions from CalCAT for models that consistently provided county-level hospitalization census predictions across a year long period from February 1, 2021 to February 1, 2022 (Table 1). Hereafter, we will use hospital census to refer to the number of patients currently hospitalized with confirmed COVID-19 for a given county and date. To explore the effects of COVID-19 variants on model performance within that period, we also compared forecasting model accuracy during three phases of the COVID-19 pandemic at the county and regional level in California (

Methods
Multiple methods exist for measuring epidemic forecast accuracy including metrics that evaluate specific point estimates and uncertainty [10]. When full predictive estimates are available, metrics like the logarithmic score or continuous ranked probability score (CRPS) provide context for probabilistic models' predictions and uncertainty. When forecasts are provided in quantile or interval formats, the weighted interval score (WIS) is a potential alternative [10]. Since not all models incorporated into CalCAT provided full predictive or interval estimates, or did so with different reporting standards, we focused on the median point estimates (50th percentile) from We utilized mean absolute error (MAE) and relative error at the 7-, 14-, and 21-day forecast horizons to evaluate the accuracy of these point estimates. To better compare across counties with different population sizes, we normalized both error types by the median hospital capacity of each county (14-day horizon results are highlighted in the main text; the remaining forecast horizons are provided in the Supporting Information). From the MAE scores, we computed a standardized ranking score for every forecasted observation relative to other models issuing a prediction for that same publication date and  [11]. In addition, we also conducted pairwise tournaments of model performance to control for the frequency of model participation. Finally, using a classification regression approach, we explored which countylevel epidemiological and socio-economic covariates could help explain the "winning" model for a given location and date based on the lowest MAE scores for a given forecast horizon.
County results are grouped by health officer regions, which are contiguous groupings of 58 counties used for health mandates in California (Fig. 1D): Association of Bay Area Health Officers (ABAHO); Greater Sacramento Region Health Officers (GSRHO); Rural Association of Northern California Health Officers (RANCHO); San Joaquin Valley Consortium (SJVC); and Southern California (SCAL). Some counties do not have major hospitals and therefore lack forecasting predictions, actual numbers of hospitalizations, or both. For this reason, Alpine, Sierra, and Sutter Counties are not included in the analyses that follow.

Mean absolute error
The raw mean absolute error (MAE) for each publication date i with associated target end dates k is calculated as: where N is the number of days into the future that the forecast is made, x i,k is the prediction made on publication date i for target end date k and x k is the actual observed value for a given target end date [11]. We then standardized the MAE by h , the median nonsurge hospital capacity of a given county: MAE h .100 Median hospital capacity was chosen for standardization because the hospital capacity for facilities, and aggregated for counties, changes through time based on staffing and other factors. Note that not all model forecasts were available for all counties or all dates. A model only received an MAE score for a given publication date if it had predictions available for the target end dates of interest (e.g., to receive a 7-day MAE score, a model must have made predictions for 1-7 days ahead of the publication date). Here we used CA-state specific data [12] for post-hoc evaluation, whereas many model teams may be relying on other data sources (e.g., U.S. Department of Health and Human Services) for fitting or calibration.

Relative error
The relative error for each publication date i with associated target end dates k is calculated as: where N is the number of days into the future that the forecast is made, x i,k is the prediction made on publication date i for target end date k and x k is the actual observed value for a given target end date [11]. We then standardized the relative error by h , the median non-surge hospital capacity of a given county: .100 Therefore, a positive relative error corresponds to a model overestimating the hospital census, while a negative relative error corresponds to an underestimation.

Standardized ranking score
For each publication date i and location j, we calculated a standardized rank for every available model m based on its associated MAE: sr m,i,j = 1 − r m,i,j −1 n i,j where r m,i,j is the ranking of the MAE of model m out of the n other models that made predictions for publication date i and location j (adapted slightly from [11]). Thus, for a given publication date i and location j , the highest possible standardized ranking score for any given model is 1 and the lowest is the inverse of the lowest possible ranking 1/n i,j . Models not participating for a given publication date i and location j receive a zero, and thus, are penalized for lack of coverage.

Pairwise tournament
To conduct a pairwise tournament, we calculated a relative MAE for each pair of models m and m ′ : where θ m,m′ is the median of the ratio of the simultaneously available MAE scores for model m to model m ′ with shared publication dates i , target end dates m , and locations j [13]. Importantly, the common locations, publication dates, and observation dates may differ for each pair of models m and m ′ . This approach varies slightly from some previous examples, as the order of operations is scale then aggregate rather than aggregate then scale [11,14].
An overall performance score of a given model, m is then calculated as the geometric mean of all relative where M is the total number of all models available for comparison. At the county level for counties with smaller hospital capacities, there was a non-trivial probability of certain models achieving an MAE of zero, which leads to relative MAE scores of zero or infinity depending on the order of comparison. To eliminate these irregularities in the pairwise comparisons, we excluded counties with median nonsurge hospital capacities ≤ 25 (i.e., Calaveras, Lassen, Mariposa, Modoc, Mono, San Benito, and Trinity).

Random forest classification analysis
To explore whether the model with the lowest MAE score for a given location and observation date could be explained by county-specific epidemiological or socioeconomic factors, we conducted a random forest classification analysis. Random forest analysis is a recursive partitioning method that improves classification accuracy by synthesizing the predictions from many classification trees [15,16]. The response variable (i.e., classification label) was the best performing model for a given county and date combination based on the lowest MAE score of the available models. We explored the covariates (i.e., features) of: progressive vaccination coverage at the county level, county-level R-effective, 7-day change in countylevel R-effective, variant prevalence at the health officer region level [17], county population size, percent of county residents in poverty (2019), percent unemployment (2020), median income (2019), five-year average percentage completing college (university degree) (2015-2019), and 2013 Rural-Urban Continuum code. All socioeconomic variables were taken from U.S. Department of Agriculture Economic Research Service county-level data sets [18]. For pre-processing, data were centered and scaled using the caret package [19]. For model training and tuning, 70% of original data was used with K-fold validation (four-fold, repeated four times). The final accuracy of the random forest classification models were 61% with mtry = 7, 66% with mtry = 7, and 68% with mtry = 7 for 7, 14, and 21 day forecast horizons respectively.

Data and code availability
The forecasts and R-effective values analyzed in this paper are available from CalCAT [4]. California-specific hospitalization data is available on the California Open Data Portal [12]. Because of reporting delays and backfilling, datasets used in the analysis may represent a snapshot of what was available at that point in time. All data and code used for analysis and figure generation is available in the public repository: https:// doi. org/ 10. 5281/ zenodo. 78512 80. Analyses were performed in R (v 3.6.0) [20].

Model performance varied across locations and under different periods of variant predominance
Model performance was heterogeneous across counties and during different periods of variant predominance (Figs. 2A, 3A and 4A), in part reflecting that the number of models available for a given publication date and location varied through time; fewer models were available during the Omicron variant period and for less populous health officer regions such as RANCHO (Supplementary Figures 3, 11, 15). For example, in Trinity Countyone of California's least populous counties -the Simple Growth model had the lowest 14-day normalized MAE for most forecast publication dates during the Alpha and Omicron predominant periods ( Figs. 2A and 4A), whereas the Columbia model had the lowest 14-day normalized MAE during the Delta period (Fig. 3A). In San Diego County, California's second most populous county, the LEMMA model had the lowest 14-day normalized MAE during the Alpha period ( Fig. 2A), and the COVID NearTerm model had the lowest 14-day MAE for the most days during Delta and Omicron periods (Fig. 3A  and 4A). Overall, the Simple Growth model performed particularly well in the RANCHO region during the Omicron period as demonstrated by a lower 14-day MAE for many counties in that region (Fig. 4A). The LEMMA model had the lowest 14-day MAE across many regions during the Omicron period on or after January 13, 2022 (Fig. 4A). In general, the range of the relative error distributions increased with longer time horizons and during the Omicron period ( Supplementary Figs. 1-2). During the Omicron period, most relative error distributions were right skewed with median relative error values less than zero, indicating a tendency for underprediction, but a non-zero probability of sizeable overprediction (Supplementary Fig. 1).
The sum of the standardized rank score (�sr m,i,j ) in each county, j, rewards both performance (model accuracy) and frequent participation (model coverage). During the Alpha period, the LEMMA model had the highest score in 20/55 counties and, the Ensemble model was a close second with the highest score in 17/55 counties (Fig. 2B). During the Delta period, the Ensemble model had the highest score in 21/55 counties (Fig. 3B). During the Omicron period, the Ensemble model had the highest score in 22/55 counties, and the Simple Growth model was a close second with 20/55 counties (Fig. 4B).
The density distributions of standardized rank sr m,i,j allow for comparison of model performance while controlling for frequency of model participation (Figs. 2C, 3C and 4C). Although COVID NearTerm did not have the highest sum of the standardized rank score in any counties, it had the highest median standardized rank score during the Alpha and Delta periods (Figs. 2C and 3C). The LEMMA model had the highest median standardized rank score during the Omicron period (Fig. 4C). The same pattern of ranking was present for 7-day MAE ( Supplementary Figs. 4-6). For 21-day MAE, the COVID NearTerm model had the highest median standardized rank score during the Alpha and Omicron periods (Supplementary Figs. 16 & 18), while the LEMMA model had the highest median rank scores during the Delta period ( Supplementary Fig. 17).

When controlling for participation, some models outperformed the ensemble, but pairwise model rankings varied across counties
When matching across all locations and all observation dates, two models-COVID NearTerm and LEMMA-performed better in pairwise comparisons relative to the Ensemble model for 14-day MAE (Fig. 5A  & B). However, pairwise rankings were quite variable when disaggregated by county and also highlighted the differences in coverage and availability across locations for different models (Fig. 5C). For example, although the Simple Growth model came fourth in the overall pairwise ranking (Fig. 5A), it came first in twelve individual counties. Similarly, the Columbia model came last in the overall pairwise ranking (Fig. 5A) and generally performed Fig. 2 Forecasting accuracy results at the county level during the Alpha wave in California as measured by mean absolute error (MAE). A Heat map of the best daily performing model for a given prediction date as measured by 14-day MAE. Each cell in the heat map corresponds to a normalized MAE calculated for the day that a model forecast was published. Counties are grouped into panels by California health officer regions. B A summary map of California where the color of the county corresponds to the model with the highest sum of the standardized rank score for that period (�sr m,i,j ) . Note that by using the summation of the standardized ranking score, models are penalized for lack of participation. C A density distribution of the standardized rank score (sr m,i,j ) that depicts the median (dashed) and mean (solid) as vertical lines for each model distribution. A standardized rank score of one indicates that a model came in first relative to other participating models for a given date and location, values closer to zero indicate that a model had a lower ranking compared to other participating models, and a value of zero corresponds to no participation worse than average (θ m > 1) , but performed better than average (θ m < 1) in Plumas and Inyo counties (Fig. 5C).
Overall pairwise rankings were robust to forecast horizon length for the complete analysis period  Forecasting accuracy results at the county level during the Delta wave in California as measured by mean absolute error (MAE). A Heat map of the best daily performing model for a given prediction date as measured by 14-day MAE. Each cell in the heat map corresponds to a standardized MAE calculated for the day that a model forecast was published. Counties are grouped into panels by California health officer regions. B A summary map of California where the color of the county corresponds to the model with the highest sum of the standardized rank score for that period (�sr m,i,j ) . Note that by using the summation of the standardized ranking score models are penalized for lack of participation. C A density distribution of the standardized rank score (sr m,i,j ) that depicts the median (dashed) and mean (solid) as vertical lines for each model distribution. A standardized rank score of one indicates that a model came in first relative to other participating models for a given date and location, values closer to zero indicate that a model had a lower ranking compared to other participating models, and a value of zero corresponds to no participation

Epidemiological traits, county population size, and variant traits best predicted forecast "winners"
For the entire analysis period (February 1, 2021-February 1, 2022), time-varying vaccine coverage at the county-level, local transmission dynamics (R-effective and 7-day change in R-effective), county population size, and regional proportion of variants, were most important in predicting which model had the lowest MAE for a given county on a given publication date ( Supplementary Fig. 23). Other static socio-economic variables like income, percent unemployment, percent of residents with a university degree, and percent of Fig. 4 Forecasting accuracy results at the county level during the Omicron wave in California as measured by mean absolute error (MAE). A Heat map of the best daily performing model for a given prediction date as measured by 14-day MAE. Each cell in the heat map corresponds to a standardized MAE calculated for the day that a model forecast was published. Counties are grouped into panels by California health officer regions. B A summary map of California where the color of the county corresponds to the model with the highest sum of the standardized rank score for that period (�sr m,i,j ) . Note that by using the summation of the standardized ranking score models are penalized for lack of participation. C A density distribution of the standardized rank score (sr m,i,j ) that depicts the median (dashed) and mean (solid) as vertical lines for each model distribution. A standardized rank score of one indicates that a model came in first relative to other participating models for a given date and location, values closer to zero indicate that a model had a lower ranking compared to other participating models, and a value of zero corresponds to no participation residents in poverty were less important for predicting model outcomes. These variable importance rankings were robust to the forecast horizon used for MAE calculations ( Supplementary Fig. 23).

Less populated counties have ensemble predictions with higher median MAE and more variable MAE
When comparing 14-day MAE normalized by hospital capacity, counties with smaller population sizes typically had a higher median MAE score and more variable MAE distributions compared to more populous counties ( Supplementary Fig. 24B). Based on a linear regression, the logarithmic of the normalized MAE score was negatively correlated with county population size coefficient estimate : 1.7 · 10 −7 ; p-value < 2 · 10 −16 . This relationship held true regardless of the forecast horizon used for MAE calculations ( Supplementary Fig. 24 A & C).

Ensemble model could be improved by incorporating geographic heterogeneity in model coverage and performance
Echoing other analyses of COVID-19 forecast performance that have described a large variation in model accuracy by location [11,21], forecasting models performed differentially across California counties and regions and for different periods of variant predominance during the COVID pandemic (Figs. 2, 3, 4 and 5C, Supplementary  Figs. 4-6, 16-18). Moreover, location-specific features like local transmission dynamics or county population size helped explain model performance ( Supplementary  Fig. 24). This geographic variation in model performance points to the importance of location-specific model evaluation in order for local health jurisdictions to best employ forecasts for public health decision making.
In general, combining multiple models into ensembles allows for better performance [11,20,[22][23][24]. However, in this case, COVID NearTerm and LEMMA consistently outperformed the Ensemble model when controlling for frequency of participation (Figs. 2, 3, 4C and 5), although pairwise ranking scores remained variable at the county level (Fig. 5C). The higher performance of individual models over the Ensemble model combined with the variability in performance at the county-level suggests that the Ensemble model does not have to be applied uniformly across all locations; public health decision making could benefit from model selection and ensemble weighting that reflects location-specific past performance as well as local transmission trends [25].

Lower forecast coverage in less populated counties weakens evidence-based decision making
One interesting question from a public health decision making context is whether model coverage (i.e., frequent issuing of forecasts across all potential locations) and model accuracy should be weighed equally when establishing the criteria for a "winning" forecast. In this analysis, there was typically a mismatch between raw model performance based on availability as measured by the sum of standardized ranking (Figs. 2, 3 and 4) and model performance when controlling for participation via pairwise tournaments (Fig. 5). In part, this disagreement reflects that not all models provided estimates for all counties, especially for less populous regions or counties (Supplementary Figs. 1,9,13). For example, the COVID NearTerm model ranked first in the pairwise ranking but provided no coverage for any counties in the less populous RANCHO region (Fig. 5C). In contrast, the Ensemble model came first in the majority of counties during the Delta and Omicron periods as measured by sum of the standardized rank score for that period sr m,i,j (Figs. 3B and 4B), but was generally outperformed in pairwise ranking evaluations both overall and for individual counties (Fig. 5). Although the Ensemble model in less populous counties exhibited a higher median normalized MAE and a more variable normalized MAE regardless of the forecast horizon ( Supplementary Fig. 22), this observation may be a direct result of calculating MAE from median point estimates rather than accounting for forecast uncertainty, since stochastic effects likely contribute more significantly to the forecast predictions for counties with smaller population sizes.
While maximizing model accuracy is important, a forecast cannot add value if it is not available for decision making. As county-level contributors are lost to attrition, ensemble estimates may further decrease in accuracy or may not be possible in these less populous counties. Policy and public health decision makers should evaluate what investments or innovations in modeling are needed to improve results for underserved counties with lower forecast coverage. In addition, decision makers could seek to incentivize the best-performing models to serve smaller counties that neither have the resources to do this work inhouse nor have academic partners readily available. The lack of coverage in smaller counties also points to the inherent complexities of interpreting in hospitalization burden-since hospitalizations are typically recorded via hospital location rather than patient residency [26]. As others have suggested, forecasting at the geographic unit of hospital referral networks could be another solution to low model coverage in less populous counties [26].

Continuity of contributors, forecast structure, and documentation helps real-time public health decision making and post-hoc analysis
The overall continuity of forecasting contributions has proved challenging for post-hoc evaluation. Although CalCAT has had roughly ten unique forecast contributors through time, many of these groups have ceased contributing as the COVID-19 landscape has increased in complexity (e.g., emerging variants, prior immunity, boosters). Although less relevant to forecasting hospitalizations, changes in case ascertainment and testing practices make retrospective analyses more challenging. Interruptions to forecast continuity can also limit post-hoc evaluation. For example, some modeling groups paused forecasts in order to reset or recalibrate for new variants like Omicron.
While initiatives like the COVID-19 Forecast Hub have worked to standardize forecast output and reporting [11], one additional challenge for this analysis was that the reporting across external forecast contributors differed. For example, across three of the externally contributed forecasts they all produced interval estimates at different cutoff points: COVID Nearterm (10, 20, 30, 40, 50, 60, 70, 80, and 90 percentiles), Columbia (2.5, 25, 50, 75 and 97.5 percentiles), and LEMMA (5, 50, 95 percentiles). This discrepancy precluded the use of more robust measures like CRPS or WIS and means that that our results are much more sensitive to the median point estimates [10]. Changing repository structures, file nomenclature, and data formatting can also disrupt the archiving process necessary for ensemble generation and subsequent post-hoc review. This analysis is a snapshot of what was available on Cal-CAT-and therefore to the general public and public health decision makers-and may not entirely reflect what model contributors would intend to be their contributing forecast at all times. The CalCAT team updated data and results iteratively as often as possible, but not all model changes were announced. As the COVID-19 pandemic necessitated rapid changes in data reporting and data infrastructure, other information technology issues may have introduced unintended errors.
The current classification regression analysis in this manuscript does not include model-specific traits. In order to truly evaluate whether underlying model traits and assumptions help to explain performance for specific locations through time, it would be necessary to have a larger number of forecasting contributors and consistent metadata on both the changes in model construction and the timing of those changes. Therefore, another potential area of documentation might include not just existing model assumptions and structure but how those characteristics have changed over time. This analysis may be easier to do at a state or national scale, where more model contributors are available, and reporting is better standardized through initiatives like the Forecast and Scenario Hubs.
Reporting and communicating infectious disease forecasting results, with all their inherent uncertainty and complexity, remain areas for improvement and growth for public health departments and their academic and industry collaborators to support evidence-based public health policy planning and decision making. Importantly, forecasting models may also serve as a communication tool to influence behavior change by the general public. One phenomenon not explored in this analysis is the potential for forecasts to alter human behavior, and subsequently model accuracy.

Conclusions
Major progress in infectious disease forecasting has been made during the COVID-19 pandemic, while ongoing challenges, such as those around data and communication, have persisted. We retrospectively investigated hospitalization census forecast model performance at the county level in the state of California. Model performance and ranking varied through space and time and by metric, highlighting the difficulty of making blanket recommendations for which models to use for individual counties, including an ensemble approach. Calibrating based on past model performance may help improve ensemble forecast generation, and counties may benefit by considering which individual model contributors have historically served them the best. Going forward, closer collaboration between forecasters, researchers, and policymakers may create positive feedback loops that inform the ongoing COVID-19 response and other future public health action.